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• ' Abstract 

o, 

(^ , We consider the augmented Lagrangian method (ALM) as a solver for the 

fused lasso signal approximator (FLSA) problem. The ALM is a dual method 

^ , in which squares of the constraint functions are added as penalties to the 

C^ ■ Lagrangian. In order to apply this method to FLSA, two types of auxiliary 

variables are introduced to transform the original unconstrained minimization 
problem into a linearly constrained minimization problem. Each updating in 
this iterative algorithm consists of just a simple one-dimensional convex pro- 
gramming problem, with closed form solution in many cases. While the existing 
literature mostly focused on the quadratic loss function, our algorithm can be 
easily implemented for general convex loss. The most attractive feature of this 



algorithm is its simplicity in implementation compared to other existing fast 
solvers. We also provide some convergence analysis of the algorithm. Finally, 
the method is illustrated with some simulation datasets. 
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1 Introduction 



In th is paper we examine the one- dimensional fussed lasso signal approximator (jTibshirani et al.l . 
20051 ). which is to solve 



min f{(3) = Fiy, /?) + Ai J^ |A| + A2 J^ |A - A_i|, (1) 

4 = 1 1=2 

where y = {yi, . . . ,yn) are the noisy observations, Ai, A2 > are two regularization 
parameters and F{y,(3) = Yll=i^ii.Pi^yi) is the loss function. The most frequently 
appearing case is the quadratic loss Fi{(3i,yi) = {yi — /3j)^/2, for which there exists 
several solvers. Here we also consider the more general case where Fi is a convex 
and coercive function of /3j. Note that by definition the coercive function F^ satisfies 
lim|^^l_j.oo -^i(/3i,l/i) — > 00, for all j v,; £ R, which i s use d to ensure the existenc e of th e 



minimizer. As demonstrated in 



Huang et al 



(120051); 



Tibshirani and Wang! fcoOSh 



an important application of FLSA is the reconstruction of copy numbers from CGH 
arrays. 

Several algorithms h ave been proposed for FLSA, including a specially designed 



quadratic programming (JTibshirani et al. 



nate descent and fusion algorithm (JFriedman et al. 



2005 



Tibshirani and Wang 



20081), coordi- 



20071 ) and a pat h algorithm tha t 



solves the problem for all regularization pa rameters simult aneously (JHoeflingl . I2OIOI ). 



Based on the numerical results performed in 



(12010[ 1 , the latter two algorithms 



are clearly very fast and efficient and represent the state of the art. However, these two 



algorithms require substantial efforts in implementation for non-expert programmers, 
since one needs to keep track of the "fused sets" wh ich contains the coeffi cients /3j 

(120071 ) has the 



Friedman et al. 



that assume the same value. Besides, the algorithm of [ 

disadvantage that once the coeff icients are fused, t he linkage cannot be removed later 

(a similar problem is noticed in IZou and Lil ( l2008l ) for the locally quadratic approxi- 



mation algorithm proposed in variable selection problem with non-conca ve penalty). 



and no convergence analysis is available. The algorithm of lHoeflingI ( 120101 ) is designed 
to solve (II]) for all regularization parameters but it does not work for general conve x 
loss since the solution path is in general not piecewise linear ( iRosset and Zhul . 120071 ) . 
Here we c onsider augmen ted L agrang i an me thod (ALM) which was independently 
developed by lHestened (119691 ) and lPowelll (119691 ) almost half a century ago, which aims 



to solve convex optimization problem with linear constraints. There are surged inter- 



ests recently in applying this method in different optimization 



2009 



Tao and Yuan 



2010 : 



Wen et al. 



2009; 



problems ( 



Yang and Zhang 



2009 



Tai and Wu 



Yang and Yuan 



20101 ). We will show that after some simple transformations of ([T]), the ALM can 
be applied to efficiently solve FLSA with general loss functions. The most attractive 
feature of the method is its simplicity in implementation. We present our R code 
for solving ([1]) with quadratic loss in Appendix B in the Supplementary Material, in 
which the main iterations consist of only about 20 lines of commands. We provide a 
clear self-contained convergence analysis of ALM in our context (Appendix A in the 
Supplementary Material) following existing ideas. Our algorithm can be initialized 



es sentially arbitrar i. 



of 



Friedman et al. 



y, in p articular i nitial ized with zero values, while for algorithms 



(120071 ): 



HoefiingI (120101 ) such initialization will not work and the 



coefficients will stay at zero at all times. 



2 Augmented Lagrangian Formulation 

By introducing the auxiliary variables 9i,i = 2,...,n, the following linearly con- 
strained problem is trivially equivalent to ([1]). 



min g{P,9) = X^F,(|/„ A) + Ai ^ |A| + Aa^ |^. 



1=1 1=1 

s.t. 6i = f3i- (3i_i,i = 2,. . . ,n. 



i=2 



Following iGlowinski and Le Tailed (1l989l ). we define the augmented Lagrangian, for 
c> 0, by 



£,(/?, 6, v) = g{/3, e) + J2 ^^(.^^ - A + A-i) + I Y.^e, - A + A-l)', 



i=2 



1=2 



where z/ = (z/2, . . . , Vn) is the Lagrange multiplier. 
We consider the following saddle-point problem, 



Find I3*,e*,u*, 
s.t. C,{/3*, e\u)< £e(/3*, r, z/*) < £,(/3, 6, u*), V /3, 6, v. (2) 



The proof for the following i s wel 



1970 



Ekeland and Turnbulll . 



19831) 



known from classical duality theory (JRockafellar 
and is thus omitted. 



Proposition 1 /3* zs a solution of (QP if and only if (/3*,^*,z/*) is a solution of Ij^ 
for some 9* and v* . 

The basic algorithm f or fin ding the saddle point is the following Algorithm 1 



( JGlowinski and Le Tailed . 119891 ). 



Algorithm 1 


initialize i/°, arbitrarily. 




For fc = 1, 2, . . . 




(/3^^^) = argmin(;3,,)£,(/3,0,z/'=-l) 




i.f = ^f-^ + c(^f-/3f + /3ti),z = 2,.. 


.,n 



In general, it is difficult to minimize Cc{(3, 6, v^) over /3 and 9 simultaneously, but 
it might be easier to minimize over /3 when fixing 9 and vice versa. In this case, we 
can alternate these two steps until convergence. It turns out that we can update /3 
and 9 just once when the other is fixed, resulting in the following algorithm. 



Algorithm 2 


initialize v^ and 9^, arbitrarily. 




For A; = 1, 2, . . . 




/3fc = argmin;3 £c(/3, ^^"S i^^~^) 




9'' = argmmeCci(3'',9,u''-^) 




z.f = i.f-^ + c(ef-/3f + /3ti),^ = 2,.. 


.,n 



Example. We apply Algorithm 2 to ([H) with quadratic loss. In this case, the 
augmented Lagrangian is 

-. n n n n 



i=2 



i=2 



4=2 



If Ai = 0, given 9^~^ and z/'^"^, the minimization over /3 is a simple quadratic problem 
and all components of /3 can be found simultaneously by solving a linear system 
Bfi = b, where we do not write down explicitly the expression of matrix B and vector 
b, but note that due to the special structure of the problem, i? is a tridiagonal matrix 
and there exists an efficient a lgorithm with comp l exity linear in n for solving the 
linear system (see for example iConte and De Boorl ( 19801 ) ). 



For Ai > 0, it is more difficult to update (3 directly. Fortunately, for quadratic 
loss, solution for FLSA with Ai > can be obta i ned by thresholding the solution for 
FLSA with Ai = as shown in 



Friedman et al. 



(J2007l ). and thus (for this example) 



we only consider Ai = 0. 

With P = P^ and v = v^~^ fixed, the minimization over ^ is a simple lasso 
regression with orthogonal design and thus we have the simple component-wise soft 
thresholding updating rule 

e^^=sign{ei)m-\2/c)+, (3) 

where 6i = /3f — [5f_i — v\~^ jc and (a)+ denotes the positive part of a. D 

For quadratic loss, the example shows that both update for /3 and for Q can be 
computed efficiently for Ai = 0. However, for more general loss and/or for Ai > 0, 
it is difficult to update /3 directly and thus in our implementation we do not use 
Algorithms 1 and 2. We propose next a further augmentation step that decouples the 
quadratic term (6'j — /3j + /3i-i)^ with the loss function. 

We introduce another set of auxiliary variables 7j, z = 1, . . . , n and consider the 
following problem which is still obviously equivalent to ([I]). 

n n n 

min g{^,P,e) = VF,(y„7i) + Ai V \-fi\ + A2 V |^i| 

'y.p.u ' ^ ' ^ ' ^ 

i=l i=l i=2 

s.t. 7i = /3j, z = 1, . . . , n, Oj = Pj - (3j^i,j = 2, . . . , n. 
The corresponding (doubly) augmented Lagrangian is 



£,(7,/3,^,/i,z/) = ^(7,/3,^) + ^/i,(7,-A) + ^^(7,-A)2 (4) 



2 



n n 

C 



+ Y. ''^(^^ -f^^ + /^-l) + 9 5Z(^* -f^i + f^i-^'- 



2 



2 

i=2 i=2 



In the above, the coefficients for both quadratic penalties are the same (equal to c/2). 
In principle, we can use different coefficients but computationally it is difficult to tune 
both parameters and thus we settle with this simpler choice. 

With the newly defined Lagrangian in (|4]), we can similarly modify the saddle- 
point problem ([2]) in an obvious way and it can be shown that the saddle-point 
problem is the same as the original FLSA problem ([T]). Accordingly, we have the 
following algorithms for finding the saddle point which directly extends Algorithm 1 
and Algorithm 2 respectively. 



Algorithm 3 


initialize u'^, 


arbitrarily. 








For k = l,2, 












(7^/3^^'= 


) = arg 


min(7,/3,0) >Cc(7, /?, 0, 


/- 


-\iy''- 


'') 


^f = ^f-^ 


+ c{ef - 


-/3f + /3ti),2 = 2, 




,n 




^^^ = f^r 


+ chf 


-/3f),z = l,...,n 









Algorithm 4 


initialize i/°, /3° and 6'°, arbitrarily. 






For k = 


= 1,2,... 






l' = 


= argmin^£,(7,/3'=-i,^'=-i,/i'^- 


-i,z/^- 


') 


15'' = 


= argmin;3 >Cc(7^ /?, ^''"S /"^ 


^fc-i) 




Q^ = 


argmine LSrl" , P'', 6, fi'"'^, u'' 


-') 




-f = 


-.ut' + c{ef-P^ + (3t,),z = 


2,... 


,n 


l4- 


--^it' + c{lf-/3^),^ = l,... 


n 





In Algorithm 3, arg min(^_/3_0) £c(7, /?, &, jJ^^ ^, t-'^ ^) is typically difficult to find di- 
rectly and iterative updating of each one of them with others fixed is applied (i.e.. 



repeat the first three steps in the loop of Algorithm 4 until convergence). We will use 
simulation later to compare the relative efficiency of Algorithm 3 and Algorithm 4. 

We now consider each update in detail. Note the doubly augmented Lagrangian 
is 

n n n n n 

i=l i=l i=2 4=1 j=l 

n n 

+ J2 "^^^^ -f^^+ /^-i) + 1 J2(^^ -f^^+ /^-i)'- 

i=2 i=2 

The update for /3 can be performed in closed form by solving a linear system, which 
still involves a tridiagonal matrix and can be solved efficiently. Note that the effect of 
introducing 7^ is to decouple some terms in the Lagrangian so that the loss function 
and the lasso penalty do not come into play when updating /3. The update for 6 is 
the same as before and can be performed with component-wise thresholding using 
the same formula ([3]). The update for 7 is generally not available in closed form. 
However, due to the special separable structure of the functional, it can be updated 
component by component, resulting in multiple one- dimensional convex optimization 
problems for which many efficient solvers exist. For different convex loss, only the 
updates for ji need to be modified. We also note that for the quadratic loss, the 
updates for 7, is also a simple soft thresholding. 

Example. In this example we take Fi{yi,'yi) = |yi — 7j|, the absolute deviation 
or Li loss. The Li loss function is an interesting alternative to the quadratic loss 
in that it is more robust to outliers. We refer to the resulting FLSA problem ([T]) 
with Li loss as LAD-FLASSO. In this case, the update of 7^ consists in minimizing 
IVi — 7i| + Ai|7i| + /ii(7j — f3i) + c/2(7j — /3j)^. Although the solution is not available in 
closed form, the function is strictly convex and differentiable except at two points, 
and t/j. Thus the minimizer can be found by comparing its values at 0, yi, and other 



potential stationary points, a total of only six cases (by considering the sign of 7^ and 
yi — 7i). Thus the update in 7 can also be found efficiently and implemented easily. 
D 

In the following theorem, we give the convergence of Algorithms 1-4. It shows 
that P^ is a minimizing sequence of the FLSA ([1]). If the minimizer is unique, then 
P^ converges to the minimizer. The proof of the theorem is given in Appendix A in 
the Supplementary Material. 

Theorem 1 For any of the algorithms 1-4, we have f{(3^) — )■ min^/(/3) where f is 
the FLSA functional defined in (Qp. 

3 Simulation Results 



We follow the similar simulation setups used in iHoeflingI (120 lOl ). Each simulated se- 
quence consists of data points with values of 0, 1, 2 and roughly 20% of the data 
points have value 1 and another 20% have value 2, with Gaussian noises added (ex- 
cept in Experiment 4 below where noise with heavy-tailed distribution is used). In 
experiments 1-3 below, we restrict ourselves to quadratic loss functions. The exper- 
iments are performed on HP workstation xw4400 with Intel Core 2 Duo Processor 
2.66GHz and 2GB of RAM, implemented in R. We also make use of the limSolve pack- 
age in R which implemented the tridiagonal matrix algorithm. We apply our doubly 
augmented Lagrangian method to the simulated dataset. The different between algo- 
rithm 3 and algorithm 4 is that algorithm 3 has an additional inner loop that applies 
the first three updatings in the loop of Algorithm 4 repeatedly till convergence. 

Experiment 1. First we study the effect of the number of iterations, T, performed 
in this inner loop. Thus Algorithm 3 corresponds to the case T — ?■ 00 while Algorithm 
4 corresponds to T = 1. In this experiment, we set the sequence length n = 200 and 



Table 1: Simulation results for Experiment 1 for varying the number of iterations of 
the inner loop. 







n=200 






n=2000 






T=l 


T=2 T=5 


T=10 


T=l 


T=2 T=5 


T=10 


number of iterations 
computation time 


226.95 
0.117 


131.61 72.70 
0.118 0.144 


69.09 
0.258 


212.02 
0.354 


147.84 78.56 
0.654 0.838 


71.21 
1.753 



n = 2000, with Gaussian noise of variance 0.1, c = 5 and Ai = 0.5, A2 = 4. 100 
datasets are simulated in this experiment. The convergence criterion used is Hz^'^ — 



u 



fc-lll I \\,,k ,,fc-l 



+ Wfi^ — fi'' ^11 < 10 ^°. In Table [H we show the average number of iterations 



required till convergence as well as the time (in seconds) elapsed. We see that although 
using T > 1 reduced the number of iterations (for the outer loop) required, the overall 
computation time is either similar to the case with T = 1 or significantly increased 
even for small value of T. Thus we see no advantage of using T > 1 and Algorithm 4 
is adopted in the following. We have also conducted simulations using other sequence 
lengths and parameters and the conclusion is the same. 

Experiment 2. Next we consider the effect of the parameter c on the convergence of 
the algorithm. Although theoretically the ALM converges in the limit for any c > 0, 
we will see that this parameter can affect the speed of convergence. Our simulation 
involves a sequence of length n = 1000 with A^(0, 0.1) noises, and we solve the FLSA 
problem with Ai = and A2 = 0.1. We choose many different values for c and the 
evolution of the mean squared error J2^^ii(^i ~ PiY /''^ fo^ fi^^ values of c is plotted 
in Figured] (a). Here /3j represents the true signal and /3f is the estimate for the fc-th 
iteration. We see that for small value of c, the convergence of the estimate is slow 
and oscillate in the initial stage, while for big values of c, the convergence is also slow. 
For this sequence, a value between 0.5 and 5 generally produces reasonable speed of 
convergence visually. 

Now we vary different parameters involved in the optimization to investigate how 



10 



these changes affect the choice of c. First we generate a sequence of length n = 100 
and another with n = 10000. The convergence diagnostic plots are shown in Figure 
[1] (b) and (c). Remarkably, the plots show that the choice of c is almost unaffected 
by the length of the sequence and the number of iterations required for convergence 
does not depend on n . This empirical observation has at least two implications, (i) 
In order to choose a reasonable value of c for an extremely long sequence, we can run 
the algorithm on a subsequence with several different values of c and choose the best 
one based on the convergence speed on the subsequence. Of course for this to work 
we need to assume the sequence is stationary in some sense, (ii) The complexity of 
the algorithm is linear in the length of the sequence since it is linear for each iteration 
and the number of iterations does not vary much with the length (note this is only 
based on empirical observation). 

Then we use the same sequence with n = 1000 but with a bigger noise variance 
0.4. The plot shown in Figured] (d) looks different, but the range of values for c that 
results in fast convergence is similar as before. In Figured] (e), we show the results 
when solving FLSA with Ai = 0.5 and A2 = 4, and in Figure d] (f) , (g) , we multiply and 
divide the noisy sequence by a factor of 10 respectively. When these parameters are 
changed, we see the trace plot is more variable. Since all these types of changes can be 
regarded as the change in relative sizes of the different terms in (jlj), we conclude that 
the choice of c depends on this relative scale but is quite stable otherwise. Even so, we 
still observe that the optimal choice of c is somewhere between 0.2 and 10. We have 
generated different sequences and worked with different regularization parameters to 
make sure the observations made above apply to a wide variety of settings. Since our 
algorithm is relatively fast, we can suggest running the algorithm for several different 
values of c and visually check its convergence, except when the sequence is extremely 
long (> 10^) and then we can run the algorithm on one or more subsequences to 
choose c before running it on the entire sequence. In all the following experiments we 

11 



set c = 5. 

Experiment 3. Here we want to say something about the computation speed of 
our algorithm based on comparisons with previous approaches. We download from 
C RAN the fl sa pa ckage (version 1.03) which is based on the path algorithm presented 



m 



HoeflingI (120ld ). For each case of n = 100, 1000, 10000, 100000, we generate 100 



sequences and the average computation times for each sequence are shown in Table 
[2] for Ai = 0.5, A2 = 4. From the results reported in the table, we see that the path 
algorithm is about 20 times faster than our ALM algorithm for n > 1000. For the 
case n = 100, the difference is about 100 fold. We observe from Table [2] that both 
algorithms have computation time approximately linear in n, except for ALM when 
n = 100. Thus the large difference for n = 100 may be due to the reason that in 
this case most of the computation time in ALM is spent on ancillary chores such 
as calling the R function, setting up parameter values and returning results. The 
reported comp utation time for the path algorithm is slower than those reported in 



HoeflingI (J2010[ ) and the reason might be due to the difference in simulation setup and 



difference in computer system used. 

We also need to note that the path algorithm is specifically designed for computing 
the entire solution path for all regularization parameters, and in this sense it should 
have even better performance when the solution for many regularization parameter 
values are sought. However, this algorithm does not work with general loss function 
as the ALM does. 

There is no publi cly available package i mplemen t ing t he descent algorithm in 



Friedman et al 



(l2007f ). However, Table 1 in iHoeflingI ( 120101 ) reported that the path 
algorithm is about 10-100 times faster than the descent algorithm when n < lO'*, while 
the two algorithms have comparable speed with larger n. Based on this comparison, 
we think our algorithm is probably comparable to the descent algorithm when n < 
10^ but much slower for bigger sequence length. Finally, we note it is difficult to 
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Figure 1: Evolution of mean squared error of reconstructed signals with iterations. 



13 



Table 2: Comparison of computation speed for our ALM algorithm and the flsa 
package based on the path algorithm. 

n = 100 n= 1000 n = 10"^ n = 10^ n = 10^ 
ALM 0.09811 0.1797 1.861 21.702 223.9 
flsa 0.00092 0.0073 0.072 0.958 10.51 



exactly compare the computation speed for different algorithms since all algorithms 
involve some parameter choice. In particular, the convergence criterion used in our 
implementation is Hz/*^ — z/^^^^H + ||;u'^' — /x'^"^!! < 10~^°, and if we increase the threshold 
to 10~^, it becomes 3 to 5 times faster. Besides, the fisa package uses C code in 
its underlying implementation which makes it faster, and our implementation uses 
tridiagonal matrix algorithm from the limSolve package which uses Fortran code in its 
implementation and thus the net effect is difficult to compare. We emphasize again 
that the biggest advantage of our algorithm is the ease in implementation as well as 
that it works with general convex loss functions. 

Experiment 4- Finally, in this experiment, we consider the LAD-FLASSO prob- 
lem, where the loss function in ([T]) is defined by Fi{yi, (3i) = \yi — (3i\. We only 
illustrate here with a sequence of length n = 100 and the noise has t distribution 
with 2 degrees of freedom and the scale parameter equal to 0.3. With heavy-tailed 
noises, the LAD-FLASSO is expected to perform better than the usual FLSA with 
quadratic loss. Indeed, Figure [2] shows the noisy sequence, the true signal, as well as 
the two reconstructions. The regularization parameters Ai and A2 in the two cases are 
those minimizing sum of squared errors and sum of absolute deviations respectively 
(of course this depends on the knowledge of the true signal in the simulation), by 
searching over a fine grid. An obvious difference between the two reconstructions is 
seen at positions 70-80, where an extremely high value of observation occurs due to 
the heavy-tailed noise distribution. 
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Figure 2: Reconstruction of a signal sequence based on FLSA with quadratic loss and 
absolute deviation (Li) loss. 
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4 Conclusion 

In this paper we propose a simple algorithm for the FLSA problem. Although not as 
fast as the path algorithm implemented in the flsa package, the most attractive feature 
of this algorithm is the simplicity of its implementation and it works for general con- 
vex loss functions. However, the computational speed of the current implementation 
in R can possibly be improved if a more general programming language such as C is 
used for its underlying implementation. Another advantage of the algorithm is that it 
is provably convergent for any initialization values, and its convergence properties are 
investigated based on simulation studies presented here. The flexibility in implemen- 
tation is demonstrated by our implementation of the LAD-FLASSO problem which 
is lacking from other existing implementations based on either descent algorithm or 
path algorithm. We expect that ALM as a general technique will be very useful for 
computing other optimization problems in statistical learning. 
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Supplementary Material 
Appendix A Proof of Theorem 1. 

In the proof we use matrix and vector notations. In particular, the expressions 6i — 
Pi + Pi-i, i = 2, . . . ,n can be written as 6* — AP with A an {n — 1) x n matrix. We 
also make frequent use of s ome standar d and cl assical results from co n vex a nalysis, 



such as those contained in 



Rockafellaj fll970h : Ekeland and Turnbulll f ll983h . most 



notably the properties of sub differential for convex functions. Also, we only show the 
convergence of Algorithms 1 and 2 while the analysis for Algorithms 3 and 4 is very 
much the same but more tedious to write down and thus omitted. 



We start with Algorithm 1, for which the augmented Lagrangian can be written 
as 

£(/3, e, u) = f/(/3) + Vie) + ^\\e-Ap\\' + v'iQ - A/3), 

where [/(/3) = YTi=\ PAvu A) + Ai YTi=\ lAI; ^(^) = ^2 YJi=2 l^^l' ^^^ v^ is the trans- 
pose of the column vector v. In the proof we only need to use the convexity of \J and 
V. 

Using the usual notation, suppose (/?*, d* , z/*) is the saddle point of £ satisfying 

£(/?*, Q\ v) < £(/?*, e*, V*) < /:(/3, e, u*) v/3, e, v (5) 

From the first equality of (|5]), we have Q* = A/3*. The update for u in Algorithm 
1 is i^'^ = i'''^^ + c{9'' — A/3'^), which implies 

ij'' = ij''-^ + c{e''-A^''), (6) 

where we set ^^ = 13^ — (3*, 6^ = 0^ — 9* and u^ = u^ — u*. From ([H]), we immediately 
get 

||-fc-i||2 _ ||-fc||2 _ _2c(p'^-i)^(^^ - AI3'') - c^\\e^ - A/^^W". 

Next we show the right hand side of the above is nonnegative. 
From the second inequality of ([5]), we have 

OedpC{(3*,e*,u*) ^ OedU{(3*)-cA'^{e* -Ap*)-u*^A, (7) 

OedeC{f3*,d*,u*) ^ OedV{e*) + c{e* -Ap*) + u*, (8) 

where d is the notation for the subdifferential of a convex function. 
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Correspondingly, based on the update of f3^ and 6'^ in Algorithm 1, we have 

0edpC{(3\e\u''-') ^ 0edUi(3'')-cA^{e''-Ap'')-{u''-YA (9) 
0edeCi(3\e\u^'') ^ 0edVi9'') + c{9''-Ap'') + u''-\ (10) 

Subtracting (^^ from ([9]) and subtracting (IHl) from ( TTOl) . we get 

e dU{p'')-dUi(3*)-cA^{9''-A(3'')-iu''-'fA (H) 

e dv{e'')-dv{e*) + c{e''-Ap'') + u''"\ (12) 

Multiplying {(3'')'^ to ([II]) from the left, multiplying (9'')'^ to (^ from the left, 
and adding the two expressions gives us 

e {du{(3^)-du{(3*)J^) + {dv{e^)-dv{e*), e^)+c\\e^-A^^\\'^+{v^-^Y{e^-A^^), 

(13) 
where we used (■, ■) to denote the dot product of two vectors i n some places above to 



be con sistent with the usual notation in convex analysis as in 



Ekeland and TurnbuU 



(Il983h . 

Prom standard results in convex analysis, all elements in ((9f/(/3^) — dU{(5*),(5^) 
and {dV{d^)—dV{d*), 6^) are nonnegative and thus we get c\\9^—AI3^\\^ + {v^^''Y{6^ — 
Afi^) < which immediately implies that 

||-fc-i||2 _ ||-/=||2 _ ^2c{iy^'Y{e^ - AI3'') - c^\\e^ - Al3^\\^ > c^\\e^ - Al3^\\\ 

Now that II^^^IP is nonnegative and decreasing, we obtain 6'' — Ap^ — )■ 0. Using 
this in (fl3!) . we get 

< {dU{f3'') - dU{f3*)J'') -> 0, < {dV{e'') - dV{e*), 6^) -^ 0, (14) 
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where the above expression is taken to mean that "there exists some sequence Uk G 
{dU{(3^) — dU{f3*),(3'') with < u^ — )■ 0", for example. Similar interpretations are 
used in the following. 

By the definition of subdifferential, we have 

U{(3') > U{(3*) + {dUmJ'), 



resulting in 

[/(Z^'^) - (9f/(/3*) J') > U{/3*)>U{l3'')-{dU{P^)J''). 

Using ( fT4|) . the difference between and left hand side and the right hand side is 
converging to zero and thus we have U{/3^) — )■ U{P*). Similarly we can show V{6'') — > 
V{6*). These combined with O'^ — AJS'' -4 prove the convergence of Algorithm 1. 

For Algorithm 2, the proof strategy is similar and we only point out the differences. 
The proof is the same as before up to equation ([H]). Because the order of update of f3 
and 9 in Algorithm 2, equation (jH]) is replaced by 

G dfsC{(3\ e''-\ z/'^-i) ^ G dUi(3'') - cA'^{9''-^ - A(3'') - {u^^YA, 

and thus equation (TTT1) becomes instead 

G dU{f3'')-dU{f3*)~cA'^{9''-^-A^'')-{u'"YA, 
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while equation (IT2|) remains the same. Then we have, in place of flT3|) . 

e {du{/3'')-du{f3*),p'') + {dv{e'')-dv{e*),e'') 

+c\\e^ - Al3^\\^ + {u'^-Yie^ - AI3'') - c{l3''fA^{e^-^ - ^^), 
which then implies 

Wi^'^Y - Wi^'W > cV - Ai3'\\' -2c''ip'fA^{e'-' - e''). (15) 

So the difference from the corresponding analysis for Algorithm 1 is the extra term 
—2c^{/3'')'^A'^{9^^^ — 9'') on the right hand side above. 

Now we analyze the term 2c^ [Ji^Y A^ [9^ — 9^~^). From (fT2|) (which is still true 
for Algorithm 2) and the update rule for v in Algorithm 2, we have 

e dV{9^)-dV{9*) + c{9^ -Ap^) + v^-^ (16) 

G dV{9^-^)-dV{9*) + c{9^-^-Ap^-^) + v^-^ (17) 

-k^i_-k-2 ^ c{9^-^ - AP^-^). (18) 

Subtracting (fT7|) from (fT6|) and taking into account (fTHj) . we get 

G 9y(^'^) - dV{9^-^) + c(^^ - A;^'^). 

Taking inner product with 9^ — 9^^^ in the above equation and using the property of 
convex function that {dV{9^) - dV{9''-^), 9'' - 9^~^) > 0, we get 

{§'' - A^''f{9'' - 9^-^) < 0, 
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and we can rewrite the above expression as 

Using the identity {O'^f {6'' - 6''-^) = 1/2(||^'=||2 - ||^*=-i||2 + H^'^ - ^fc-i||2) we obtain 
from flTSl) 

\\u''-^\\^ - iiz/'^ip > c^we'' - A(3''\\^ + c^i\\9''\\^ - \\e''-^\\'^ + \\e''- e'^-^)- 

After rearranging, we get 



and then \\9'' — A(3''\\ — > and \\9'' — 9'^ ^\\ ^ 0. Now the rest of the analysis follows 
that for Algorithm 1 with no changes. 

Appendix B R code for FLSA with quadratic loss 

f lasso. alm<-f unction (y,lambdal,lambda2,C=5,tol=le- 10) { 
n<-length(y) 

#initialization 

beta<-y 

theta<-rep(0,n-l) 

gainma<-rep (0 , n) 

mu<-rep(0,n) 

nu<-rep(0,n-l) 
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conv<-100 
iter<-0 

while (conv>tol){ 

temp<- (y+C*beta-mu/2) / (1+C) 
gainma<-abs (temp) -lambdal/ ( 1+C) 
gamma<-pmax (0 , ganima) *sign (temp) 

##compute rhs of the linear system for solving beta 

tempK — C*gamma; temp2< — mu; temp3<-c(theta[l] ,diff (theta) ,-theta[n-l] )*C; 

temp4<-c (nu [1] , dif f (nu) , -nu [n-1] ) ; rhs<-templ+temp2+temp3+temp4 

##compute the three diagonals in the linear system 

diagl<-rep(-C/2,n-l) ; 

diag2<-c (C/2 , rep (C , n-2) , C/2) +rep (C/2 , n) 

##call the tridiagonal matrix algorithm 

beta<-Solve . tridiag(diagl , diag2 , diagl , -rhs/2) 

temp<-dif f (beta) -nu/b 
theta<-abs (temp) -lainbda2/b 
theta<-pmax (0 , theta) *sign (temp) 

premu<-mu 

mu<-mu+C* (gamma-beta) 

prenu<-nu 

nu<-nu+C* (theta-dif f (beta) ) 



24 



conv<-niean(c((nu-prenu)"2, (mu-premu)~2)) #used to test convergence 
iter<-iter+l 
}-#while loop end 

#return the estimated signal and number of iterations performed 

list (beta=beta, iter=iter) 

} 
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